09) 




02) 



(43) Date ofpubficatipn: 

3a06.1999 Bulletin 1999/26 



EuropSJsches Patentamt ' ' ~ 
European Patent Off tee 
Office eurpp^ des brevets (11) BP 0 926 594 A1 

EUROPEAN PATEMT APPLICATION 

(51) lnt a 6 :G06F9/45 



(21 ) Application number: 97310249.4 

(22) of fOing: 17.12.1997 



(84) Designated Contracting States: 

At BECHDE DKESFI FR GB GR IE IT U LU MC 
NLPTSE 

Designated Extension States: 
ALLTLVMKROSI 

(71) Applicant 
Hewlett-Packard Company 
Palo Alto, California 94304 (US) 

(72) Inventors: 

• Solomon, Charles Reed 
Bristol BS94RU(GB) 



• Olgiati, Andrea 
Bristol BS16, IAD (GB) 

(74) Representative: 

Lawrence, Richard Anthony et al 
Hewlett-Packard Limited, 
IP Section, 
Building 2, 
FiHonRoad 

Stoke Gffford, Bristol BS12 602 (GB) 



(54) Method of using primary and secondary processors 



(57) The invention relates to the compilation of 
source code to a primary and a secondary processor. It 
relates to reconfigurabie secondary processors, and is 
especially relevant to secondary processors which can 
be reconfigured to some degree during execution of 
coda Selective extraction of dataflows from the source 
code is followed by transformation of the extracted data- 
flows into trees. The trees are then matched against 
each other to determine minimum edit cost relation- 
ships for transformation of one tree into another, where 
these minimum edit cost relationships are determined 
by the architecture of the secondary processor. A group 
or a plurality of groups of dataflows is determined on the 
basis of said minimum edit cost relationships and for 
each group a generic dataflow capable of supporting 
each dataflow in that group is created. The generic 
dataflow or dataflows is then used to determine the 
hardware configuration of the secondary processor; 
and calls to the secondary processor for said group or 
plurality of groups of dataflows are substituted into the 
source code. The resultant source code is compiled to 
the primary processor. 

The resulting efficient configuration thus reduces 
either the expense of reconfiguration (in a field program- 
mable array), or the silicon area (in an application spe- 
cific integrated circuit). 



source codfe 
fC»JAVA> 
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Description 



PW01J Tbepres<rt in^ 

sistrng of a primary processor and one (or more) secondary processors. The invention is particularly, though not exefu- 
5 sivety. relevant to the architectures ernploying a r^^ pamcuiany. mougn not exdu- 

J00021 A primary processor - such as a Pentium processor in a conventional PC (Pentium is a Trade Mark of Intel 
Corporation) - has evolved to be versatile, in that it is adapted to handle a wide rage of corrputational tasks without 
bemgcpt^sed for any of them. Such a processor is thus not optimised to handle efficiently conpjtationally intensive 

„ ope^r^s^asjoarallelsi*^vMOfdta^Su^ 

J^^ 3 ^ 030 "* 1 *" to sotvethis problem is ^thedevebj3mert d rntegrated circuits specifically adapted for oar- 

. . tjcutar apdicafons These are known as ASICs, or appfication-specrJic integrated circuits. Tasks for whtehsuch aASIC 
f aC ^?!i^! 9®"*™^ Peffomied very weO: however, the ASIC «rip geh^Bl^perform poorly, if at all. on tasks for wW* 
'f* 8 'C <^ be buflt for a particular application, but this is not a desirabfo^c«2 

15 r^*«*«huspanKulanyadvar^^ 
o^as^redW 
afme^jaoc^^ 

!^asl£^ 

» [0004] ^corif^urable coproe^ _ 
partadar tos^erunineffidently by the primary proc^ can be extracted and run more 

secondary processors, the possibility of improving performance by extracting difficult code to a custom coorwessor 
* i^gXZg^^ * ^ ^ ■* ^ """^ ** ^^^^ 

u5 0btai " * ie deSired efficiency 6**'** necessary to determine as effectively as possible how code is to be 

S^T 5 '^ ° n FPGAS Custom Computing Machines. Napa, California. April 1995. a epp^^«So5 



extract initially, 

35 



SfiSflJ^^" 311 ^ W 103 ^ fe to 383683 code to determine which the most appropriate elemerrts to direct 

uu^ot BecKer and Rainer Kress, m Int IEEE Symposium on Engineering of Conputer Based Systems fECRSi fh d 
™Z?ZZ allocation tea coprocessor and which should be reserved for the primary processorThis is 

» that the extracted code can be mapped to the coprocessor. This approach does expand the usaoert fsel 
ond^processors.butdoesnot^ 

£™2.^5 rnPa '^L® approaches have been proposed in the BRASS research project at the University of Berkefev An 

45 5E!^i5£2? w - 9W»m on Reprogrammable Custom Computing Machines. April 16-181997 Naoa 
SS' JST* J"*""* avaBaWe °" » 3 World Wide Web at wIS 
aS£^ e ^f^ C - f ^-^^ thUmb - ps) - 0888 structures representative of a JgkIXES 

asasi ,n themappmg of source code on to FPGA structures. Source code samples are rendered aTdSac^fc 
%S£Z^ T to frees - ^ese and other basic graph coneys are set oj for ^ e m S 

Sered orSS? °! f * ? "2? " «****«»*** 3 * defined by a pair of nodes (and can be 
^e^ SSS, i? , ° ,n ^ A ^ 08066 ertherdirected0flwlirected - " adirected graph 
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■ nodes. 

[0009] InfoeworcofTimCallahan&Jo^ 

tree covering" program called Iburg. Iburg is a generally available software tool, and its applicatioh is described in "A 
Retargetable C Compiler: Design and Implementation', Christopher W. Fraser and David ft Hanson. Ber^amnTCunv 
mmgs PubOshing Ca. Inc.. Redwood City, 1995, especially at pp 373-407. Iburg takes as input the source code trees 
and partitions this input into chunks that correspond to instructions on the target processor. This partition is termed a 
tree cover. This approach is essentially determined by the user-defined patterns allowable for a chunk, and is relatively 
complex: it involves a bottom-up matching of a tree with patterns, recording aO possible matches, followed by a top- 
down reduction pass to determine which match of patterns provides the lowest cost Again, this approach requires a 
j significant initial constraint in the form of the predefined set of allowable patterns, and does not fully reafee the oossi- 
bffltiesof areconngurat^earchitecturet ' 

rooiOJ There is thus a need to develop techniques and approaches to further improve computational efficiency of sys- 
tems involving a primary and secondary processor, by which an optimal choice can be made for allocation of code to a 
secondary processor, which can then be configured as efficiently as possible to run the extracted code, with a view to 
maximising the performance efficiency of the primary and secondary processor system in ex^ 
[0011] Accordingly, the invention provides a method of compfling source code to a primary arid a secondary proces- 
sor, comprising: selective extras c4 dataflows fiom tte 

trees: matching of the trees against each other to determine minimum edit cost relationships for transformation of one 
tree into another;determining a grouper a plurafity of groups of dataflows on the basis of said minimum edit cost rela- 
tionships and creating for each group a generic dataflow capable of supporting each dataflow in that group- using the 
generic dataflow or dataflows to determine the hardware configuration of Ihe secondary processor and substituting into 
the source code calls to the secondary processor for safo group or plurality of groups 
resultant source code to the primary processor. ;.. • ^ ^ 

[0012] This approach allows for optimal selection of source code dataflows for allocation to thesecondary processor 
without prejudgement of suitability (by. for example mapping onto predetermined templates) but while still taking full 
account of thedemands and requirements of foe secondary processor architecture AoVantageously. said minimum edit 
cost relationships are determined according to the architecture of the secondary processor, and represent a hardware 
costofa ccnesppnding reconfiguration of the secondary processor. The method is particularly effective if the minimum 
edit cost relationships are embodied in a taxonomy of minimum edft distances for classification d the tiee^ 
[0013] The method finds its most useful application, where the hardware configuration of the secondary processor 
allows for reconfiguration of the secondary processor during execution of the source code, as this allows for reconfigu- 
ration of the secondary processor to be required during execution of the source code to sipport each dataflow in foe 
group supported by a generic dataflow. The secondary processor; may thus be an application specific instruction pro* 

^'^5\ Pr ° Ce ? S £ •'iSS? v ^. bea field programmable gate array or a field programmable arithmetic array 
(such as that shown in the CHESS architecture discussed in Appendix A). 

^nn^^^l^^ 0 dateflOW * a sr0up fe ****** «* m approximate mapping of dataflows in the 
group on to each other, followed by a merge operation. * 

[0015] Anadvaritegeousappipachtoconstructiw 

cfiral graphs and reduce them to trees by removal of any links in the directed acydfoal graphs not present in a critical 
pjatobe^ena^ 

which passes through the largest number of intermediate nodes Alternative criteria to the critical path can be adopted 
l^^Tf^ ^ seco ™ lar y.P r ° cessor hardware (** example, if a different criterion can be found which is more 
sensitive to the timing of operations in foe secondary processor): 

[0016] An advantageous further step can be taken after the creation of a generic dataflow.^ which the generic data- 
flow rs compared with further dataflows extracted from the source code, wherein those of said further dataflows which 
match sufficiently closely the generic dataflow are added to the generic dataflow. This enables more or all of the code 
present in the source code which is suitable for allocation to the secondary processor to be so allocated 
J™ 71 me aPf oaches indicated above, foe removed links are stored after the directed acycfical graphs are 
reducedto trees and are remserted into the generic dataflow after me merging <rf foe trees of me grc^ 

Xg dSwfogt^Nmlch 6 "' 5 " *" ™ al60n 6eSabed ba °"' by 61 exam P le - reference to the accom- 

Rgure 1 shows a general purpose computer architecture to which embodiments of the invention can suitably be 

Figure 2 shows schematically a method of compiling source code to a primary and a secondary processor accord- 
ing to an embodiment of the invention; Ji««w««*u«r 

Rgure 3 illustrates a step of conversion of a DAG to a tree employed in a method step according to one embodi- 
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ment of the invention; 

Figure 4a illustrates the step of insertion and ddetionof rwdesandRgijre 
nodes m a tree matching process emptoyed in a method step ara^ 

Figure 5 shows an edit distance taxonomy provided in an example according to an embodiment of the irwention- 
5 Figure 6 illustrates a generic dataflow provided in an exairple according to one embodiment of tne invention- 

Rgure 7 shows a logical interface for allocation of secorxlarypittjessor resources for a generic dat^ 

to an embodiment of the invention; . 

Figure 8 shows the application of DAGs to dataflows indudirig multiplexers to handle conditional statements- and 

Figures 9a to 9d *ow an illustration of me n^ing of 
10 according to an enfcodnnerttf the ^ - 

[0019] Tneiresertiriveritionte 

a secondary processor. An example of such an arctttecture is shown in Rgure ^ 

tional general-purpose processor, such as a Pentium II processor of a personal computer. Receiving caBs from the pri- 
w mary processor 1 and returning responses to it are secondary processors 2 and (optionally) 4. Each secondary 
processor 2.4 is adapted to inae^ 

source code not well handled by the primary processor 1, Secondary processor 4, optionally present here, is a dedi- 
cated coprocessor adapted to handle a specific function (such as JPEG, DSP or the like) - the structure of this coproc- 
essor 4 will be determined by a manufacturer to handle a specific frequently used function. Such coprocessoreT are 
so not the specific subject of the present application. By contrast, the secondary processor 2 is not already optimised for 
a specific function, but is instead configurate to enable improved 

by the J^Wwcessor. The secondary processor 2 is advantageously an application specific structure: it can be a 
conventional FPGA, such as the XKnx 4013 or any other member of the XiBnx 4000 series. An alternative class of 
reconfigurable device, referred to as a field programmable arithmetic array, is described in Appendix A hereto Such a 
2? secondary processor can be configured for high computational efficiency in handling desired parts of the source code 
for an appfication to be executed by toe arc^^ 

[0020] Also employed in the conrputer archrtectore are menxwy 3. acc^^ 

appropriate types of secondary processor 2. by the secondary processor 2. and input/output channel 5. Iroutfoutout 
channels here represents an further channels and hardware necessary to enable the user to interact wift 
30 sc*(forexanple, by program 

S^!l^!^ e !!^ Venti0n iS P 3 ^' 3 ^ relevanl to tne OP*" 1 *** partitioning of source code between primary 
processor 1 art secondary processor 2. which allows for optimal configuration of secondary processor 2 to optimise 
me handling of the application embodied in the source code by the architecture. A significant contribution is made by 
trie invention m the selection and extraclfon d code for use m the secondary processw 
as [0022] ^eappro^tateaacc^^ 

process is a body of source code. In principle, this can be in any language : the example described was carried out on 
C code, but the person skilled in the art will readily understand how the techniques described could be adopted with 
other languages. For example, the source code could be Java byte code if Ja« 

„ L^f*!! 01 R9Ure 1 ^1 adapted todirectfy receiving and executing source code received from 

*° . .me internet 

fr^hf^ * «he secondary processor 2. Typically, this is done by performing dataflow analysis on the source code 
TS? 6 re P resentafions <* *«aflows presented by selected lines of code fin most processes, this 

Z^^^J^T ia ABlanaS * *>■ Ada P tive M"*"™ Macni ™ ArcS- 

SgeelS 

W024] The approach taken here is to build directed acyclical graphs (DAGs) which represent the dataflows of selected 
so oSL^!^ 960118 *f yt ° d0 » fe by «i"9 a compfler infrastructure appropriately configured for the eS 
1^1^\Z a T 3Pnaie ""P"* infrastructure fe mp > d ^loped by the University of Stanford and dcoTertS 
SEES? °* ^ Wet> ^ ""P^-^^edu/ and elsewhere. SUF^devised for co^T^e^S 
^^t!? r !^! ^^^"^v including systems comprising more than one processor. A standard SUIF 
I'S? C ° nVer l C to SUK It is then a sirrple process for one stalled in toe art to use SuWoote to 

extraction of DAGs from source code is a conventional step. The next step in the process, as can be seen 
o^reSctS ^n^l* 656 DAGS ™ S iS 8 ^n^anttalr in mS^e^SS 

manner. Reduction of DAGs to trees allows the aspects of the dataflows most important in determining their mapping 
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to hardware to be retained, while simplify to allow ainalytical approaches to be made signif- 

icantly more effective. 

[0026] Discussion of tie reduction of DAGs to trees is made in "High Performance Compilere for Paraflei Computing" 
(as cited above), especiaDy at pages 56 to 60. Different terminology is used here from that used in the cfted reference 
but equivalent and comparable terms are indicated below. The type of trees constructed here are directly corrparable 
to the "spanning trees" referred to in the cited reference. 

[0027] The preferred approach followed in the reduction of DAGs to trees is the removal of links hot in the critical path 
between leaf nodes and the root: this is iBustrated in Figure 3. The critical path between nodes A and B is in a first 
embodiment of this reduction process defined as the one that touches the maximum number of nodes. As a DAG is, by 
> definition, acyclic distinct patte can be defined to meet this criterioa it is possible for there to be different paths 
between nodes that have the same maximum numb* of nodes, but these paths are likely aD to be satisfactory for the 
purpose of tree construction. While making an arbitrary selection between these paths is a valid approach, a key issue 
in mapping the source code successfully is scheduling, which depends on timing information: accordingly, where it is 
necessary to mate a choice between alternative "critical paths" it is desirable to choose the one that would tate the 
longest time (in terms of time taken to execute each of the operations represented by the nodes in the path). As is dis- 
cussed further below, alternative approaches can be adopted which are based more directly on timing information. It is 
also desirable to adopt a consistent approach in making such choices - otherwise morphologically different trees can 
result from essentially similar DAGs. 

[002?] The process taken in applying this first embodiment of the critical path criterion is as follow* Firstly for every 
leaf node, every possible path towards the root is chased: as the DAG is a directed graph, this is straightforward As 
indicated above, for each leaf node the path with the greatest number of nodes is chosen, and if two paths are found to 
have the same number of nodes, a selection is made. This is the critical path for that leaf node, AD other paths not 
selected are cut in their edge closest to the starting point This cut edge is termed a minor Gnk (equivalent to tie term 
"cross-fink'' in the Wolfe reference). The tree consists of the assembly of critical paths, and contains no minor links The 
minor links are stored separately. Minor links will be required when extracted source code is mapped to secondary proc- 
essor 2. but are not used in determining which source code is to be napped to the secondary processor. 
[0029] It is of course possfcle to construct aces from DAGs without using the critical path criterion. Use of the critical 
path does provide particular advantages. In particular, removal as minor links of the cross-fihks not in the critical path 
wffl have little effect on scheduling, whereas if another approach was adopted removed cross-finks may have a consid- 
erable influence on timing and hence on scheduling. Use of the critical path criterion aOows construction of a tree which 
represents as best possible the critical features of the DAG in the context of moping to hardware 
[0030] Rgure 3 shows the application of the process described in the preceding paragraph. Source code extract 11 
shows three lines under consideration for executfon by secondary processor 2. DAG 12 shows these three lines of code 
represented as a directed acycfical graph, with root 126 (variable e) and leaf nodes 121, 129 and 130 as the irputs. 
[0031] It is now a straightforward matter to assess each path from a given leaf node to the root, and to compare the 
number of nodes in each path. From node 129 (integer value 2), there is only one path, through nodes 122 123 124 
and 125. This is then the critical path from leaf node 129 to root node 126, and will be present in the tree From node 
121 fm the present case the result of an earlier operation arxl designated c). there are two path^ 
through nodes 122, 123, 124 and 125, whereas the seoond path passes through nodes 127, 128 arKM25. The first path 
is the critical path, as it passes through more nodes: the second path can thus be cut as is discussed below The 
remaining leaf node 130 (variable b) also has two paths available: one passes through nodes 123, 124 and 125 
whereas the other passes through nodes 127, 128 and 125. These are equivalent in terms of nurrt>er of nodes and so 
erther path can be chosen as the critical path: however, for reasons discussed above (timing and morphological con- 
sistency) it is desirable to operate under an appropriate set of further rules to make the best selection. Such further 
rues may, for example, be determined on the basis of the relevart hardware Here, the second p^ is crx)^ 
[0032] The next step to take is to construct a tree 14 from the critical paths c^en from the DAG 12. This is done by 
cutting all noncrrtical paths in their edge closest to the starting point (that is. the edge closest to the starting point which 
is not also part of a critical path), The first non-critical path to consider is that from node 1 21 to root 1 26 through nodes 
12* 128 and 125. This can be cut on the edge between nodes 121 and 127 - in the tree, this is represented by removal 
or edge 151 between nodes 141 (corresponding to 121) and 147 (corresponding to 127) which is stored separately as 
a minor fink. The other non-critical path to consider is that from node 130 to root 126 through nodes 123 124 and 125- 
thtscan be cut on the edge between nodes 1 30 and 123. Again, this cut edge is stored as a minor fink. * 
10033] It should be noted that conditionals can be represented in DAGs and so reduced to trees in exactly the same 
way as simple equations. An example is shown in Figure 8: this is a DAG representing the dataflow of the fines. 
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if(x<2)y 

a=b ■ ' 

' • a = / :' •' ... • ".' : 

and shows a multiplexer node 188 and a less than' operation node 186 in addition to the variable and integer nodes 
181. 182, 183 and 184. As the skilled man wfll appreciate, It will generally be possible to use the approach shown hem 
for source code which can be represented as a DAG. - ^ 

[0034J _The tree structure that Is left • in this case, tree 14 - is a much easier structure to use In determining which 
source code should be mapped to secondary processor 2, as is discussed further below. The technique described 
above rs a particularly appropriate one for converting DAQs to trees, as it is straightforward to implement is general to 
appfcation. and through use of the critical path maintains the maximum -depth* of the computational engine to be syn- 
thesjsed (assuming each node represents a single computational element) because of the inclusion of paths with toe 

edgesare to be removed m convert ng the DAQs into frees can be adopted. One alternative embodiment of the DAG to 
tree reduction process is to assign a frning-based weight to every node (based, for example, on the length of time 
T T ,!lf XeeU,e £ e < ? ores P ondin 0 computational element) and then to compare the accumulated weights of each 
path, selecting a path to define the tree accordingly on the basis of. for example, greatest accumulated weight This 

25 ** and " par * a,lar * timing dependencies are not mainly related to the mode counted (which may the case in struc- 
tures where, for example, multiplication is several times more time amsurrangth^ 

I? 35 ! JP 16 "f* fte compilation process - « can be seen from Rgure 2. takes trees as inputs and ctetemtines 
the selection of source code for the secondary processor 2. As is further illustrated in Rgure 2. this step of ihe process 
o°mpns»a series of sul>steps. The first of mese is the analysis and clas^cation of toe frees result™ 
a> dWatedataflowsThisisasignricarto esurang nomine can 

10036] The objective in this stage of the compilation process is to determine as best possible which of the candidate 

S 3 ^^!^ 1 ^! S °T :e 0066 WOuld ^ ^ choices for execution by the secondary processor. This is to a larae 
d^eed^entontoenaturaoftoehardware 

code to the secondary processor 2 can be made where dataflows are sufficiently similar that broadly the same hard- 
*: ware representation can be used for each dataflow. It therefore follows that good choices of cand^S^t 
mapping to the secondary processor can be made by finding sets of dataflows that are sufficiently similar to each other 
The* what is ach.eved by analysing and classifying the trees resulting from the candidate dataflows. 

EJ^f^Tlf 1 "^ 6 for !^ hin9 *** ^ mms <**»**<™« ^ the invention, is the tree matching algo- 
ntrmia^edbyKaizhcngZhangofto^ -a*a"- 

» 100381 ^ Thfealgorithm is described in Kaizhong Zhang. 'A Constrained Edit Distance Between Unordered Labelled- 

^^TJ^l 5 *^ 222 - Verta9 - ^ fe ^ a toolkit by the University of wStoS 

tf»tooflrtbe.ngrttoetmeof^ 

ISSSSSS^ ^T*"* 8PPr0aCheS * matChin9 frees t0 de,ermine a <* parity fterebetwee^are 
IS .The prmople of operation of Zhang's algorithm is the following: two trees are compared node-by-node trvouoh 
*<£amc programming technique that minimises the edit operations required to transform one free into another This 
cost of fransformaton is termed here an edit cost. The edit costs of successively lamer subtrees are cn^onpara? 
w*^rd be.ng kept of toe minimum costs found, The computational structure can be characten^asZoU 

^HaJ^,^ 0 ^ ****** are insertion - deletion and substitution. These are shown in Rgures 4a and 4b 
^Z^^T^l** 1 51 ^ f,ve and tree 152 with six nodes. The structure of the trees can be made 
Ktentaarf byadrtton of a node between nodes 3 and 5 of fr ee 151: this new node gives toe structure of freelS^ 
f S^SS^ 0 ^ 151 10 free 152 fe achi6Ved by}nSe ^ n °f «s nodV and fraSfonrXJtra?^ 
So>lare b^^bv^^t^ 2 IS" architecture desc ^«* - A. ^elebon- is represent^ S 

naraware by "bypass of a unit of the array: this is an example of an architecturally designed cost - in this case a 

TSSZ SL 8 ?- F l R9Ure ^ two trees 151 and 152 have the same structure, but toe tvvo^od^4rep^em 
a different type of operation in each tree: it is therefore necessary to substitute for node 4 in transforming onetreeto 



EP 0926 594 A1 

the* other/Every node therefore needs a TabeTc a tag attached to the node which identfies the type of node among the 
. various, types of node possWe. 
[0041] As previously indicated, each of these edit operatforishasacokm 

for example, the same result maybe achieved in some architectures either by an insertion and a deletiori or by a sub- 
stitution: the costs of these different atternatives can be compared. 

[0042] The result of the comparison of two trees by this algorithm is the production of a list of pairs of nodes (h',12) 
where t1 belongs to the first tree and 12 belongs to the second tree. Each pairing constitutes an identification of similar 
points in the two trees, suggesting the mapping of t1 and t2 on to each other. The list of pairs effectively defines the 
skeleton of a tree which can contain either of the compared trees: in this skeleton, to transform the first tree into the 
io second tree, each node tl has to be substituted with the respective 12. Nodes that do not occur in the mapping must be 
either inserted or deleted depending on which tree they belong to. as is discussed further below. For this list of pairs 
there wifl be defined an edit distance: this is the minimum in edit costs cumulated over the pairs necessary to transform 
one tree to the other. The algorithm is devised to determine an edit distance between two trees, together with the set of 
. transformations which achieves that edit distance: alternative transformations wfll be possible but they will have a 
higher associated cumulative edit cost 

[0043] The value of <»mputing an.ecft distance based on edit costs is that the edit costs may be chosen to represent 
the -hardware cosT in reconfiguring the secondary processor from the configuration representing one treetoa config- 
uration representing the other tree in a mapping. This *haidware cost" is typically a measure of the quantity of second- 
ary processor resources that wfll be taken up to achieve the second configuration given the existence of the first - this 
can be considered, for example, in terms of the additional area of device used. These costs wfll be determined by the 
nature of the secondary processor hardware, as for different types of hardware the physical realisation of insertion 
deletion and substitution operations will be different For the reconfigurable CHESS array discussed in Appendix A. a 
"bypass" operation involves minimal cost a substitution between an adds and subs (addition and subtraction opera- 
tions) has low cost whereas substitution between mute and divs (murrjplication and Division operations) is expensive 
[0044] As indicated above an edit distance between two trees can be constructed. However, a further step can be 
taken: using Zhang's algorithm, a a comparable approach, a taxonomy can be built to show the edit distances between 
each one of a set of trees. This taxonomy can readily be provided in the form of a tree, of which an example is shown 
in Figure 5. Each leaf node 161 of the tree represent a candidate tree extracted from a DAG. and each intermediate 
node 162 represents an edrt cost The tree provides a unique path between each pair of leaf nodes. The edit distance 
between the two leaf nodes of a pair is found by nation of costs provided at each intermediate node on this patii For 
example the ecWdTstance between any pair of the leaf nodes representing Tree#4, Tree#5 or Tree#6 is 6 However the 
edft ois^e between Tree#1 and Tree#4 is 496: thesumn^ 
and 6. 

[0045] This taxonomy is indicative of the number of edit operations required to translate between trees. Such a tax- 
onomy is a valuable tool, as it can be used heuristically as a metric for the degree of variation between candidate trees 
The creation of a taxonomy thus renders it easy to determine which trees are sufficiently similar to be consolidated 
together (as wfll be discussed below), and which are too diverse for this purpose This can be done by imposition of an 
edit cfistance threshold. A group of trees can be selected for consolidation if the edit distance between each id every 
possible pair of trees m the group is less than the edit distance threshold. The value of toe edit distance threshold is 
arbitrary, and can be chosen by the person skilled in the art in the context of specific primary and secondary processors 
in order to optimise the performance of the system 

[0046] The advamage of consolidating a group of trees te that a common riarch^ 

^L 9r0 ^ 1 Waf , SUPPOrt *? tinc60n * ™» fe P^arty appropriate for architectures, such as 

CHESS, in which low-latency partial reconfiguration mechanisms are available on the secondary processor Reconfig- 
uration s requ.red to change the configuration from that to support the function of one tree to that to support the function 
of another tree: however, as the edit distance between these trees wfll never be greater than the edit cost threshold the 
degree of reconfiguration required is already known to be within acceptable bounds; The group of trees are consoli- 
dated together by construction of a "supertree" which contains a representation of every component tree After it has 
been constructed, the supertree can be converted into a representation of each of the relevant DAGs extracted from the 
source code by reinsertion of the previously removed minor finks. The hardware configuration may then be determined 
from the fuD supertree. The construction of the supertree is discussed in detail below. 

EJ'IL ^ 9Ur ! 6 ,l nuStr f eS * e * construction ^ a supertree from a group of trees which fall below the specified 
edit cost threshold: such a group of trees is here termed a class. The trees 171, 172 and 173 can aO be mapped 
together into supertree 1 70. The reconfiguration required to change the hardware configuration from that to support tor 
example, tree 171 tothat of tree 172 is sufficiently limited to be realizable in practice, because the edit distance between 
the two trees is below the edit cost threshold. ewu usance ueiween 

[0048] An exemplary supertree assembly algorithm, merge, is provided as C code in Appendix B. The function of the 
algorithm is described below, with reference to Figure 9. The algorithm contains the following elements: 
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merge: ■ ' : ' \ - .. -; *■ .■ 

[0049] The tree in tfie class with the largest number of nodes is chosen to be the initial merge tree if there are trees 
with an equal number of nodes, an arbitrary selection can be mad* The remaining trees are termed source trees. 
[0050] For each source tree the flowing operation 

From the mapping between the source tree and the merge tree which has been calculated (in this embodiment, 
from Zhang's algorithm and edit costs determined from the secondary processor architecture), the superfree s 
constructed as follows: 

1. Firstly, mapped nodes closest to the root are considered; 

2. The source tree operation (source operation) is concatenated to the corresponding mapped merae free 
operation (merge operatidn); ..wgeiree 

3. For each child operation of the source operation 

a. If the child is mapped, revert to step 2 wim respect to the source cWW 

Ix If the child is rot mapped, then consider whether there is any mapping in the subtree of which the child 
is the root (source subtree): 

X If there is no further mapping, simp^ adopt the source siibfreefbr merging into the merge tree under 
the coire^ponding merge tree node 
ii. If there is a further mapping inside the sourcesubfree. com 

a. If the merge operatidn of m^^ 

remove the mapped source operation from the source free. There is recursion present at this 
stage - where mapped children have already been dealt with, all that needs to be done is to 
remove what would otherwise be a cross tree link. 

b. This is shown in Figure 9. If the merge operation of this subordinate mapping does fan within 
the previously mapped subtree, climb up the merge free untfl the least common arx^stor for all 
contained sifcordinate mappings is found. The least cornm^ 

all of the source mappings. The unmapped source segment is then mapped into the merge free 
by OnWng the source operation of the unmapped source subtree as a child of the least common 
ancestors parent, and by linking the least conmonarrcestor as the difld of the unmap^ 
operation just above the dosest mapped sou^ 

est mapped source operation* delimits the lower end of an unmapped segment of the source tree 
and is a mapped node which falls within the subtree of the current mapping - the source node* 
parent, which is unmapped, adopts the merge tree* least common ancestor as a child and vice 
versa). 

The pair of intermingled trees are normalised into a single free, whi^ 
The procedure continues Mntil aD the source frees In the class are contained within the merge tree, which is now a 
supertree. 

[00511 Tlreprocess is indicated in and a source tree 

IT thre8 mappin9S between ^es made by the comparison algorithm - the remaining nodes need 
* b ^f ert ^. a PP r °P rale, y- ^ indicated in section 1 above, the first step is to consider the mapped operations nearest 
the root - ,n this case, at the root These operations A are concatenated. «»^» operanons nearest 

S^L^L* 1 ' 5 ' nodes ■* A * * e 600,156 free are considered. Node B does not have a mapping and is not 

anancestor to any mappings - it is therefore merged as a child of AA (see Figure 9b). The other child node of A C 
doe^ev^ have descendart mappir^ (D arrf 

operatons fall m the previously mapped subtree (as they are both descendants of A). B is therefore necessary to follow 
? 6 ^ se Z SeCt0n The l6ast «"™"°" ancestor containing both mapped moderations 

•Dand E .s X. C of the source tree is thus linked into the merge tree as cf«Td of AA (toe parer^ aTpa^tlfT 

^^ 9 T! rt, ^?!l^^'!L F!0Ure9b " ,he merBin9 fe^^by concatenationormergingof the remaining nodes 
of the source tree, all of which steps are sto^gMfoward. *»"uo«. 

2JS*U!l r ^J? nt SUpertree203 fe 6tam in R9ure ™» s "Pertree 203 acts as merge tree for the merging h 
a supertree node - merging is thus entirely straightforward, and consists only of concatenation (ie substitution). This 
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process continues urp^ 

renranrng DAGs wtf ih^M by a backmapping process. Processes derived S^LSSSSSSS^ 
echn^suchasl^canbeutiOsedlbrlte 
touseofZhang^alflprllhm.andmat*^ 
» ,^alow«e^ 

^2n™ ^Sfl"!^ 1,19 8upertree n^^. « n«»e^ Control intbnn^ related to any 
such dataflows added by tWs backmapping process needs to be stored aJso ' 

EL!2^^ DAGs on 

75 ^ converse into trees preluding here ariy DAQs added TheiMuWho 
strurture^aclassdataflow. which represen^^ 

for the supertree (foreuurple. todetermine any reconfiguration m^^JS^ti^T^Z 

"2* PUrP ° Se 01 dete " nininfl *" ha, * rare <^^^^secona^^« a^at, 
^Msedtot^eaetnKtureiorenabing 

» essa: these steps are described further below. • ^ eauswwieseconou yproc- 

!l^^r ,aS prescribes Penphery of the dataflow. The actions required Su^ctto 

^ da^tow) wrth toad pr^rbves and of the cxrtput of the dataflow (root of the relevant free) with a read TnT^i™ aS 
m ^^rel^treeareeoritainedihthesupertree;^ 

labelled Input Tree #3. is shown, together with a supertree. labelled PFU Tree. Each nod* in im^Z«KT 

» other I/O resources are allocated to deleaves and the root The implicit mapping biKlSffiStS^ 
thus P^es 8 correspondence between operation IDs of the Input Tree nodesand the I/O r^our^aDocated for PFU 
Tree.nfte^aspeclficatiai.Theapplte^ 

35 SSLtSlS! S?*^ rt ,S P 0588 " 6 to confl9ure *• 6econdar y P™*** This step can be conducted 
SiT^?^ 0 '' 6 class dataflow to a netlist (wit h insert, dJ^SSSSSi- 
ato* and rndudingm appropriate form ajiy dynamic reconfguration instructions), and then mapping theSto^he 

I^i^S^S architectures, these steps can be carried out essentially by use of appropriate known 

pan !»_used. Firstly, the netfist can be rendered in XBhw netjist format (XNF). This can then be followed Z^m^ 
into configurable logic btocte and irput/output btocks by the XIDnx Partition Place S Sroa^flSnS 

SSrTf^R^ ^fT * pr0vfeionof P redetemtined reconfiguration solutions, in -P^TJme SogSSnS 
|. tSf^J^ nfi9ur f b,e Computer * Casselman. currently available on me Wc>rU VV^TS 

Sj^T T^* °r 5Ufin9 ° Perated by SB Associates ' he. of 504 Nino Avenue. Los dafos^A^jS. 
a^me^ESdZTdt^ T JS* ** ^ * and^econSe ptcSS. sSn 

Si S2f 10 * ^ 10015 to the processor concerned. 

15 9enefHted h executab,e torm wrth appropriate calls to the secondary processor and 
2Z JZ* * PrOCeSSOr """BMWtton has been determined, the source code can be loaded andSS' The 
source code is executed in the primary processor with calls to coprocessors and the secor^oc^rtnt' s* 

Sn £iS ^ a « jfication <* * * method of this embodi rr«nt of the 

Srril'l" 1 ,?" 15 here descriDed ^ Particularly effective to aBow for optimal use of the secondary accessor 
manarchrtecturecompr^ apnrrery proc^ esecorwa^r processor 
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CHESS array 

The CHESS array is a variety of field programmable array in which the programmable 
elements are not gates, as in an FPGA, but 4-bit arithmetic logic units (ALUs) The array 
configuration is desaibed in detail in European Patent Application No. 9730Q563.0, and the 
ALU structure and provision of instruction to ALUs is discussed in a ^copending applfcation 
entitled "Reconfigurable Processor Devic« w airf filed on the same date as the present 
application. 

^ CHESS ^ray consists of a chessboard layout with alternating squares comprising an ALU 
and a switchbox structure respectively. The configuration memory far an adjacent switchbox 
is held in die ALU. dividual ALUs may be used in a processing pipeline, and in a preferred 
implementation, provision is made to allow dynamic provision of instructions from one ALU 
to determine the function of a succeeding ALU. ALUs are 4-bit, with four identical bitslices, 
with 4-bit inputs A and B taken directly from an extensive 4-bit interconnect wiring network, 
and 4-bit output U provided to the wiring network through an optionally Iatehable output 
register: 1-bit carry input and output are also provided and have their own interconnect 

Dynamic instructions are providable from the output U of one ALU to a 4-bit instruction input 
I of another ALU. The cany output C w of one ALU can also be used as of another ALU 
With the effect of changing the instruction of that ALU. 

The CHESS ALU is adapted to support nmltiplexing between A and B inputs, and also 
supports multiplexing between related instructions (eg OR/NOR, AND/NAND). 
Reconfiguration between such instructions can be achieved through appropriate use of the 
carry inputs and outputs without coiisumption of silicon. More complex reconfigurations (eg 
AND/XOR, Add/Sub) can be achieved through using two ALUs, the first to miltiplex between 
the two alternative instructions and the second to execute the chosen instruction on the 
operands. MultipUcation will take up more than a single ALU, making reconfiguration 
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involving a multq>lication operation more complex. It is straightforward using the multiplexer 
edacity of a CHESS ALU to "bypass* an gperarioh" wftti sryi ^g*? control resulting in 
either performance of the operation or propagation of a given input. 



10 



is indicated in Table Al 
below: a wide range of possibilities are available with appropriate logic in connection of the 
instruction ugnits to the ALU. The fanations are described in Table A2. 
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Carryln value 










0 


:' 1 


0 


0 


0 


0 


XOR 


NXOR 


0 


0 


0 


1 


AANDB 


A ORB 


p 




1 


0 


AANDB 


A ORB 


0 


0 


1 


1 


ADD 


o 


1 


0 


0 


A OR B 


AANDB 


o 


1 


0 


i 


B 


A 


0 


i 


1 


0 


A 


B 


0 


i 


1 


i 


MATCHO 


1 


0 


0 


o 


ANANDB 


A NORBiv 


1 


0 


0 


i 


NOT A 


NOTB 


I •. 


0 


1 


0 


NOTB 


NOT A 


1 


0 


1 


1 


MATCH1 


1 


1 


0 


0 






1 


1 


0 


1 






1 


1 


1 


0 


A EQUALS B 


1 


1 


1 


1 


SUB 



Table Al: Instruction bits and corresponding functions 
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| Name 


j -. U function 


function I 


| ADD 


J A plus B 


Arithmetic cany | 


SUBA 


\ A minus B . 


Arithmetic cany 1 


| A AND B 


1 Uj = Aj AND Bj 




| A OR B . 


1 Uj-AjORBj 


c 0M = c in I 


1 A NOR B 


Uj = NOT (Aj OR Bj) 


Cqjjj — Cj|j I 


A XOR B 


I Uj = AjXORBj 


C^ = CL 1 


[ ANXORB 


j Uj = NOT (Aj XOR Bj) 


C™. - 1 


1 A AND B 


1 Uj = Aj AND (NOT Bj) 




1 BANDS 


1 Uj = (NOT Aj) AND B> 


C™. = C 1 


| A ORB 


| Uj ss (NOT Aj)OR B| 


C«* "as (V 1 


| BORA 


1 Uj = Aj OR (NOT Bj) 


c = c 1 


I • A '"■ : . 1 


Uj = A, . 


r „ = c 1 


1 B 


Uj = Bj 


r, = C 1 


1 NOT A 


Uj = NOT Aj 


Out — Cj,, J 


1 NOT B 1 


Uj = NOTBj 


Cout^Cj, 


I A EQUALS B | 


Not arolicable 




1 MATCH 1 1 


Not applicable 


bitwise AND of A and B, j 
followed by OR across 1 
width of the word I 


| MATCHO 1 


Not applicable 


bitwise OR of A and B, J 
followed by an AND across 1 
the width of the word I 



Table A2: Outputs for instructions 



2s complement arithmetic is used, and the arithmetic cany is provided to be consistent with 
«his arithmetic. The MATCH functions are so-called because for MATCH1 the value of 1 is 
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Claims 

1. AmethbdofcompiDngsou^ . 

selective extraction of dataflows from the source code; 
transformation of the extracted dataflows into trees; 
matching of the trees against each other to defernMemirw^ 
tree into another; 
determ^^ 

andcre^ngforea<*gro^ \ 
using the generic dataflow or dataflows to determine the hardware configuration of the secondary processor; 
and .." • *. 

substituting into the source code to the secondary proc^ 
flows, and compiling the resultant source cxxJe to the primary processor. 

J 2. A method as claimed in dalml, wherein said min^^ 

imum edit distances tor classification of the trees. 

3. A method as daimed in claim 1 or daim 2, wherein said minimum edit cost relationships are determined according 
to the architecture of the secondary processor, and represent s hardware cost of a cdnespdnding reconfiguration 
of the secondary processor. 

4. AnrethodasdaimedinanyofcW^ 
for reconfiguration of the secondary processor ^during execution of the source code. 

5. AmeWasdain^fc 

6. A method as daimed in claim 4. wherein the secondary processor is a field programmable gate aroy. 

30 7 " A method as claimed in claim 4, wherein the secondary processor is a field programmable arithmetic array. 

8. A method as claimed in any of daims 4 to 7. wherein reconfiguration of the secondary processor is required during 
execution of the source code to support each dataflow in the grcxp su^ * 

35 9. Amethod^c^ 

mapping of dataflows in the group on to each other, foflowed by a merge operation. 

10. A method as daimed in any preceding claim, wherein the dataflows are provided as directed acycfical graphs and 
^reduced to trees by removal of any links in the directed acyclical graphs not present in a critical path between 

«0 a leaf mxle arid the root (>f a cfrected acyclical graph. 

11. A method as claimed in clajm 10, wherein the critical path is a path between two nodes which passes through the 
largest number of intermediate nodes. " 

. ' • • • : • • .. ' •. '.•>).. ■ ■ • ' . 

<5 12. AmethcdasdaimedbdaimlO.wte^ 
rated execution time. 

1 3. A method as claimed in any of claims 1 0 to 1 2, wherein after the creation of a generic dataflow, the generic dataflow 
is compared with further dataflows extracted from the source code and provided in the manner defined in claim 1 0 
wherein those of said further dataflows which match sufficiently closely the generic dataflow are added to the 
genenc dataflow. 

14 ' tH^S a i d ^J n <* dairns 10 or claim 13 where dependent on claim 9. wherein the removed links are 
stored after the directed acyclical graphs are reduced to trees and are reinserted into the generic dataflow after the 
55 merging of the trees of the groupinto the generic dataflow. »««*«a«arowanerine 
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